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ABSTRACT 

PILGRM (the platform for interactive learning by 
genomics results mining) puts advanced supervised 
analysis techniques applied to enormous gene 
expression compendia into the hands of bench 
biologists. This flexible system empowers its users 
to answer diverse biological questions that are often 
outside of the scope of common databases in a 
data-driven manner. This capability allows domain 
experts to quickly and easily generate hypotheses 
about biological processes, tissues or diseases of 
interest. Specifically PILGRM helps biologists 
generate these hypotheses by analyzing the expres- 
sion levels of known relevant genes in large 
compendia of microarray data. Because PILGRM is 
data-driven, it complements a user's knowledge and 
literature analysis with mining of diverse functional 
genomic data, thereby generating novel predictions 
that can drive experimental follow-up. This server is 
free, does not require registration and is available 
for use at http://pilgrm.princeton.edu. 

INTRODUCTION 

High-throughput genomic data contain information about 
diverse processes, tissues and diseases. The application of 
data-mining algorithms to these large genomic datasets 
provides great potential for uncovering novel biology, 
but currently this potential is not often realized because 
collecting, properly processing and analyzing these data 
requires substantial computational resources and sophisti- 
cated programming knowledge. On the other hand, setting 
up analyses to address important biological questions and 
testing novel predictions resulting from such analyses 
requires detailed experimental knowledge. 

Although there are several successful applications of 
sophisticated computing approaches to diverse functional 
genomics data collections (1-5), including some that share 
results through a web site (6-9), currently there is not an 



easy way for a researcher to set up new analyses and ask 
specific biological questions by focusing these analyses on 
a sub-process or tissue of interest. This greatly constrains 
the utility of the novel predictions, because direct experi- 
mental validation for some processes or tissues may be 
impractical. PILGRM (the platform for interactive 
learning by genomics results mining) addresses this limi- 
tation by allowing its users to generate specific biological 
hypotheses by directing the supervised analyses of global 
microarray expression collections simply by defining their 
own gold standards (lists of genes relevant to a process, 
disease or tissue). Such an approach puts sophisticated 
computational tools in the hands of biologists, thereby 
combining their biological insight with a powerful compu- 
tational strategy. This flexibility lets users address ques- 
tions as diverse as their research programs while targeting 
predictions to experimentally testable pathways, tissues or 
phenotypes. 

Efforts to predict protein function, expression or local- 
ization from high-throughput data compendia generally 
make computational predictions based on annotations 
from expert-curated literature-derived databases. The 
limited coverage of these databases constrains bioinfor- 
matics strategies that use only database standards. These 
databases also do not represent unpublished experimental 
results that may be informative for future experiments. By 
encouraging and enabling users to define their own stand- 
ards, PILGRM also alleviates this issue of limited 
database coverage. 

However, PILGRM does not eschew these expert- 
curated literature-derived databases. Indeed as the suc- 
cessful prior applications of data mining strategies to 
these compendia have shown, these databases have great 
value. This is why PILGRM contains extensive collections 
of data and database-derived gold standards (detailed 
in Table 1) for Homo sapiens and the model organisms 
Mus musculus, Rattus norvegicus, Caenorhabditis elegans, 
Arabidopsis thaliana and Saccharomyces cerevisiae. 
We automatically process and integrate many sources 
of gene-annotation in PILGRM. We include the Gene 
Ontology, which has annotations for a protein's biological 
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Table 1. PILGRM contains large data compendia and standards 
derived from literature-curated databases for each the organisms 
that it covers 





Experiments 


Arrays 


Genes 


Standards 


Unique 
publications 


Human 


2392 


77 473 


21702 


7484 


32 567 


Mouse 


2012 


31374 


24 555 


6864 


14248 


Yeast 


117 


1801 


6077 


4231 


10134 


Arabidopsis 


408 


5465 


22121 


3929 


6836 


Rat 


440 


10 376 


21416 


5242 


14 395 


Worm 


53 


963 


17027 


1782 


2489 



The unique publications column shows how many distinct publications 
are represented in the gold standards pre-loaded in PILGRM for each 
organism. This table shows the status of these collections as of 31 
January 201 1. 



process involvement, localization and biochemical 
function (10,11), the Plant Ontology, which has annota- 
tions for a protein's role in plant development and anat- 
omy (12), the Saccharomyces Genome Database 
phenotype annotations, which specify phenotypes 
observed when genes are knocked out (13) and the 
Human Protein Reference Database's Tissue annotations, 
which provide literature-derived annotations of tissue 
specific expression, localization and function for human 
proteins (14). We are adding new databases as they are 
requested by users. These database annotations provide a 
convenient starting point for user-defined standards and 
analyses. 

For example, a researcher studying breast cancer pro- 
gression may be interested in identifying novel candidate 
genes involved in breast cancer progression while avoiding 
genes that appear relevant simply because they are ex- 
pressed in mammary epithelium (i.e. genes discoverable 
by a simple correlation analysis). This researcher can 
take advantage of both custom standards and the 
included database annotations in PILGRM. Setting up 
such an analysis without PILGRM would require that 
he download the full collection of over 70 000 gene expres- 
sion experiments for human, develop appropriate data 
processing, normalization and integration methods, and 
set up a machine-learning framework for the analysis. 
He would then have to download the HPRD database 
to identify genes known to be expressed in the 
mammary epithelium, in addition to creating his custom 
standard of genes involved in breast cancer progression. 

In contrast, this analysis takes minutes in PILGRM: 
Figure 1 shows the steps that this user performs during 
the preparation and interpretation of this analysis. First, 
the researcher develops a gold standard for genes involved 
in breast cancer through his own expertise and a literature 
search (Figure 1A). The PILGRM server allows each link 
between a gene and a gold standard to be associated with 
PubMed identifiers and these publication-annotated links 
are included in a downloadable document (PDF format) 
describing each analysis that is made available to the user 
(such a document can be used for additional record 
keeping by the users, to inform a Materials and methods 
section, or directly as Supplementary Data in publications 



resulting from this analysis). Second, he creates an 
analysis and pairs this breast cancer standard with the 
HPRD mammary epithelium standard included in 
PILGRM (Figure IB). As a final step, the researcher 
runs this analysis and both metrics for the machine- 
learning results and novel predictions are returned by 
PILGRM. 

By combining custom standards with appropriate 
literature-curated databases and sophisticated machine 
learning of the Support Vector Machine (SVM) classifier 
implemented in PILGRM, this researcher discovers genes 
relevant to his research and saves time without com- 
promising the flexibility or quality of his data-driven 
predictions. These relevant genes behave similarly (e.g. 
through co-expression) to the genes defined as interesting 
by the user (the positive standard) in informative experi- 
mental conditions. The machine-learning approach auto- 
matically identifies the conditions that best differentiate 
positive standard genes from those in the negative 
standard (genes with properties that the user wishes to 
avoid in new predictions). PILGRM provides both novel 
predictions and high-quality interactive visualizations of 
analysis results for the researcher to explore. 

PILGRM's main features are as follows. 

(i) A flexible interface that encourages user-defined 
data-driven analyses that answer diverse questions 
of biological interest including those outside the 
scope of common databases. 

(ii) Regularly updated compendia of uniformly pro- 
cessed genomic data for human and common 
model organisms. 

(iii) Regularly updated gold standards for tissue, function 
and development from common sources (GO, PO, 
HPRD, etc.) that make setting up analyses quick 
and easy. 

(iv) User-set levels of access control (public, hidden, 
private) for standards and analyses, allowing users 
to include unpublished results in PILGRM. 



SYSTEM DESCRIPTION 

Each PILGRM analysis begins with an important bio- 
logical question defined by the user. The user translates 
this question into appropriate gold standards, thereby 
defining the corresponding machine-learning problem. 
Gold standards are structured as positives (which repre- 
sent genes like those that the user is seeking) and negatives 
(which represent genes with properties the user wants to 
exclude) and can be drawn from databases or developed 
by the user. These standards are added to an analysis that 
is run by the user. PILGRM then classifies all other genes 
in the organism of interest with a machine-learning algo- 
rithm that employs the user-provided positive and 
negative standards, thereby generating novel predictions. 
This process is summarized in Figure 2 and discussed in 
detail in Supplementary Data SI. 

In addition to novel predictions, the user is provided 
with interactive visualizations of standard quantitative 
metrics for evaluating results of classification algorithms 
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Create a Standard 

! Breast Cancer Standard j 



Add a List of Genes 

Gene List: 



BRCA1 2270482 
BRCA2 8091231 
SLC22A1 8 16624517 
TP53 1373310 
RB1CC1 12068296 
RAD51 9008167 
CHEK2 11967536 
BARD1 9425226 
CDH1 9581841 



B Create an Analysis 

Breast Cancer vs. Mammary Tissue 



Add standards to the analysis: 



Kalmyrzaev R. Pharoah PD, Easton DF, Ponder BA, Dunning AM, SEARCH Team Hyaluronan- 
medialed motility receptor gene single nucleotide polymorphisms and risk of breast cancer. Cancer 
Epidemiol Biomattccrs Prsv. 2008 Decl7(I2):361B-20. [PubMed] 

Rebbeck TR, Godwin AK, Button KH. Variability In loss of constitutional heterozygosity at 
among individuals; association with candidate genes In ductal breast carcinoma. Mo/Carcinog. 1996 
Nov: 17(3): 117-25. [PubMed] 



Title 


Creator 


| Organism 


| Add 


Breast Cancer Standard 


FunctionLab 


Homo sapiens 


m 


HPRD: Mammary epithelium 


FunctionLab 


Homo sapiens 


m 




Top Novel Predictions: 


Gene Name 


Gene Score 


5MC4 


0.716971 


ATAD2 


0.699506 


KIF11 


0.687772 


SMC2 


0.674306 


TOP2A 


0.662981 


NUSAP1 


0.652727 


PRIM2 


0.646344 


RAD51AP1 


0.644775 


TOPBP1 


0.641232 


FAN CI 


0.640462 


RFC5 


0.638036 


TMPO 


0.636029 


FBX05 


0.635727 



7,500 10,(100 1!,S00 15,000 



Figure 1, The flow of a PILGRM analysis that uses one custom standard and a pre-loaded standard to discover genes related to breast cancer 
progression while excluding general mammary epithelium genes. (A) The process of creating a standard and adding genes (here shown with optional 
PubMed IDs) to it. (B) The process of setting up and running an analysis. The breast cancer standard from (A) is combined with the HPRD 
mammary epithelium standard that is pre-loaded into PILGRM. The breast cancer standard is a positive and the mammary epithelium is a negative 
(here both are shown together for clarity). The analysis is run and standards quantitative performance metrics and novel predictions are provided to 
the user. 



including the area under the curve (AUC), a figure 
showing the precision-recall trade-off, and a figure com- 
paring the true positive rate and false positive rate (shown 
in Figure 3A). PILGRM provides this high-quality results 
visualization with cross-platform JavaScript that is 
accessible without proprietary plugins in all modern web 
browsers. JavaScript also allows for interactive figures 
that provide additional information on mouseover (as 
with the mouseover display of genes from each standard 
shown in Figure 3B). This interactivity allows researchers 
to more fully understand how each gene in a standard is 
classified. Users also have the option of including valid- 
ation standards that are also shown on this figure. 
Validation standards are not used for classification and 
can be used to highlight genes of interest or to further 
assess prediction quality. All these results figures can be 



exported to JPG, PNG, SVG or PDF for easy inclusion in 
reports and publications. Additionally, the web server is 
capable of producing a document for each analysis that 
provides a detailed explanation of the methods, data and 
results specific to a user's analysis. This document is 
formatted as a PDF and is intended as Supplementary 
Data for molecular biology manuscripts informed by a 
PILGRM analysis. 

Our server employs SVMs for classification. Specifically 
we use the linear SVM implementation from SVM perf (15). 
We have evaluated other implementations (including poly- 
nomial and RBF kernels) and linear SVM offers classifi- 
cation performance that is better or comparable to more 
complex forms often at substantially faster speed (16). Our 
server handles running the analysis, parameter selection 
and cross validation. The analyses are run on a 
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Web Server 



User Input 

Positive Standards 

• Breast Cancer for Tutorial 
Negative Standards 

• HPRD: Mammary epithelium 



Top Novel Predictions: 

Gene Name Gene Score 



• ATAD2 

• SMC4 

• KIF11 

• TOP2A 

• SMC2 



0.703403 
0.697672 
0.684291 
0.672494 
0.660348 





Computing Cluster 



.•• • 



Array 1 

Figure 2. (A) This diagram shows the flow of each PILGRM analysis. We pre-process separate datasets into a gene-expression compendium for each 
organism. (B) The user provides positive and negative standards (either input by the user or from common databases) and the data are labeled with 
these standards. The SVM algorithm identifies the maximum-margin hyperplane (here a dotted line for the two-arrays in this example, but in practice 
this is a plane in very high-dimensional space) that best separates the positive (red) and negative (blue) standards by gene expression. Unlabeled genes 
(black) are then ranked by their distance to this plane (C), and the ranked list is returned to the user as predictions (D). The user is also provided 
detailed evaluation plots based on cross-validation (Figure 3). 



high-performance computing cluster in the Lewis-Sigler 
Institute for Bioinformatics at Princeton University. 

PILGRM currently contains data and standards for six 
organisms (human and the model organisms yeast, worm, 
mouse, rat and arabidopsis as detailed in Table 1) and 
additional organisms are added upon request. The data 
are processed uniformly and in a manner robust to 
diverse platforms and many experimental biases. As an 
example, for Affymetrix data compendia all supplemen- 
tary CEL files available in the Gene Expression Omnibus 
(17) are downloaded, and their probes mapped to Entrez 
GenelDs using the Entrez BrainArray CustomCDF (18). 
All arrays are processed within their experiment (GEO 
series) using the affy (19) R package from Bioconductor 
(20). Expression values are summarized with the 
medianpolish (21) method after RMA background correc- 
tion (21) and quantile normalization (22). At this point, 
experiment sets with five or fewer arrays are combined 
into a single set of arrays. Genes are then normalized 
within experiments and combined for learning using our 
open-source C++ Sleipnir library for computational func- 
tional genomics (23). 

Data compendia are updated monthly through an 
automated but supervised pipeline. Each new analysis 
is assigned to the current organism-specific data compen- 
dium when it is created. When data are updated, existing 
analyses are not affected. Users can, at the granularity of 
individual analyses, elect to have PILGRM re-perform 



their exact analysis using the most current data 
compendium. 

Because experimenters can include their own unpub- 
lished experimental results as part of their custom gold 
standards and because PILGRM predictions are used to 
direct follow-up bench experiments, PILGRM offers 
multiple levels of access control. Analyses may be com- 
pletely public, which allows anyone to view the analysis. 
These are shown in lists of analyses on the site. Analyses 
may also be hidden. Hidden analyses and standards are 
not shown in lists on the site and are accessible only 
through a user-defined web address. With registration, 
analyses may be made private. This is the highest level of 
protection and prohibits access by anyone other than the 
analysis owner. Registration is simple, completely optional 
(the only PILGRM capability that needs registration is 
making analyses completely private) and requires only a 
username, working email address, and password. 

PILGRM provides step-by-step tutorials for creating 
standards and running analyses. Optional example input, 
which builds a hypothetical analysis of genes relevant 
to breast cancer but not mammary tissue in general 
(one step of this analysis is shown in Figure 4), is 
provided for these tutorials. Standards and analyses 
created during the tutorials can then be used outside of 
the tutorial framework. 

The PILGRM server is a flexible tool that biologists can 
use to develop data-driven predictions of gene properties 
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B 



0.15 



SCM4from GO:0007049 cell cycle :-0. 0754861 
Rank: 4526 



3k 

Rank 



• GO:0000278 mitotic cell cycle 

♦ GO:0007049 cell cycle 

■ GO:0006974 response to DNA damage stimulus 
A GO:0000077 DNA damage checkpoint 



ited by PILGRM 



Figure 3. An example of figures produced by PILGRM. (A) The true positive rate at various false positive rates for the case study of yeast 
DNA-damage repair. The area under the curve, shown in blue, is 0.7189 for this analysis and the performance of a random classifier is shown 
by the grey line. (B) Illustrations how PILGRM figures are highly interactive. In this visualization, the rank and score from PILGRM are plotted for 
each gene in the positive (red) and negative (green) standards. Moving the mouse over a point shows which gene it represents. Clicking on a standard 
toggles it between shown and hidden (here GO:0006974 has been hidden). 



directly relevant to their experimental questions in 
under an hour. Regularly updated data compendia and 
database-derived gold standards insure that PILGRM 
remains current. Its user-defined access control lets 
researchers include unpublished findings to iteratively 
improve prediction quality without compromising novel 
findings. PILGRM gives expert biologists a chance to 
use their expertise to mine large scale genomic compendia 
quickly and easily. 

CASE STUDY: YEAST DNA-DAMAGE REPAIR 

PILGRM's capabilities are perhaps best illustrated in a 
case study. This case study represents a researcher 



interested in identifying novel candidate genes that are 
involved in DNA-damage repair while excluding genes 
only generally related to cell cycle control. The first step 
of a PILGRM analysis is to determine what the positive 
and negative standards should be. The positive standard 
should represent DNA-damage repair genes. In this case, 
the researcher uses a PILGRM-provided positive standard 
of yeast genes with direct experimental annotations to 
GO:0006794 (response to DNA-damage stimulus) and 
GO:0000077 (DNA-damage checkpoint). The negative 
standard should represent cell-cycle-related genes. She 
elects to use a negative standard containing yeast genes 
with direct experimental annotations to GO:0000278 
(mitotic cell cycle) and GO:0007049 (cell cycle); this 
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Add Individual Genes 





Official Symbol 


Organism 


Aliases 


Description 


BRCA1 


Filter Alias 


Filter Description 




Homo 
sapiens 


BRCAI, BROVCA1, 




BRCA1 


PNCA4, BRCC1, RNF53, 
PSCP, IRIS 


breast cancer 1, ... 



Showing 1 to 1 of 1. Filtering from 21,749. 


j First | Previous 


1 Next 


Last 



PubMed ID: 



Work with a Standard Step 7 

Now add the selected gene back to the 
standard. 



^"etch 



lan B, Morrow JE, Anderson LA, Huey B, King MC (1990). Link 
r to chromosome 1 7q21 . Science. [PubMed] 



Add Selected Genes 



Figure 4. PILGRM contains step-by-step tutorials that familiarize users with the system. Optional example input is provided for each tutorial. 
The optional example represents an analysis of breast cancer progression that avoids genes that appear relevant simply because they are expressed in 
mammary epithelium. 



standard is also included in PILGRM (as are all GO-based 
standards). Although in this case study the analysis uses 
only standards from the Gene Ontology's biological 
process ontology, researchers are free to customize these 
standards or add additional ones for their own analyses. 

The researcher runs the analysis using PILGRM's yeast 
gene expression compendium, which consists of all 
S. cerevisiae expression (GDS) datasets from GEO. The 
PILGRM data processing pipeline (invisible to the user), 
has already done all the pre-processing for this analysis: 
the supplied probe identifiers were mapped to Entrez iden- 
tifiers; each array was normalized with a Fisher Z-trans- 
form; genes were normalized with experiments and 
combined for learning using our Sleipnir library for com- 
putational functional genomics (23). In total this compen- 
dium of 5. cerevisiae GDS datasets from GEO contains 
1801 arrays from 117 different experiments covering 6077 
Entrez gene identifiers as of 31 January 2011. 

She then can interactively interpret the results of her 
analysis. She sees an AUC visualization and is informed 
that the area under the curve for this analysis is 0.7189 
(Figure 3A). She also can examine the list of novel predic- 
tions, with link-outs to appropriate model organism data- 
bases to provide gene-specific information for each 
prediction. In this case, the top novel prediction is the 
gene YMR090W, which SGD (24) lists as a putative 
protein with unknown function. This gene is not essential 
(25) and is up-regulated in response to the fungicide 
mancozeb in a proteome-wide screen (26). Mancozeb 
has been shown, in rats, to induce single strand breaks 
in a dose-dependent manner (27). Thus, in this case 
study PILGRM discovers a potentially relevant gene not 
previously associated with DNA-damage repair that has 
promising experimental support. Such analysis would take 
a researcher a total of 15min to perform using PILGRM, 
including all analysis setup and definition of gold stand- 
ards. This complete analysis is available at http://pilgrm 



.princeton.edu/analysis/view/case-study-yeast-dna- 
damage-response/. 

DISCUSSION 

PILGRM is a user-friendly exploratory tool for expert 
biologists who wish to use current knowledge and 
genome-wide experimental data to guide the design of 
future experiments. The extensive pre-loaded data collec- 
tions and literature-based standards from common data- 
bases make it easy for researchers to start using the system. 
PILGRM is being actively developed, and we will continue 
adding capabilities based upon user requests. Currently we 
are working to include RNA-Seq data and developing an 
interface to allow users to perform an analysis on a 
user-defined subset of the data compendium. This web 
server's flexibility allows biologists to customize analyses 
that address-specific questions of interest within diverse 
topics such as protein function, tissue-specific gene expres- 
sion and cellular localization by employing computing 
approaches for data-driven generation of accurate hypo- 
theses. PILGRM thus brings sophisticated machine- 
learning methods applied to enormous gene expression 
compendia into the lab of any researcher, enabling data- 
driven experiment direction complementary to traditional 
knowledge-based discovery provided by existing databases. 

SUPPLEMENTARY DATA 

Supplementary Data are available at NAR Online. 
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